Apache Spark
Libraries
- pyspark.sql.functions: get spark functions
- pyspark.sql.types: get spark types
Resilient Distributed Dataset (RDD)
- Low level datasets that splits data between machines
Datasets
- Buil on top of RDD
- catalog: has functions to list tables on the sparksession
- list_tables: lists all tables in the catalog
RDD
Transformation functions
filter
- df.filter(df.dest == "asdfg")
map
flatmap
union
- unions two RDD by location (not by schema) so we need to be careful of the col position
reduceByKey: Combines values same key
groupbByKey: group values same key
sortByKey: sort RDD by key
join: join two pair based RDD by ley
Read/write operations
- read/write
- csv
- json
- parquet
- options
Action Functions
- collect: return all elements
- take: take first n
- first: take first element
- count
- repartition: reshuffle whole dataset
repartition(5, col(country))
- coalesce: reshuffles some of the partitions
- useful to avoid full reshuffle
- broadcast: provides a copy of the data to all workers
DataFrames
Transformation functions
- select
- filter/where
- groupBy
- orderBy/sort
- can use inside asc, desc, asc_nulls_first, desc_nulls_first, asc_nulls_last, desc_nulls_last
- sortWithinPartitions (sorts on each partition and can be useful to improve perf for further transformations)
- dropDuplicates
- drop: removes a column or multiple ones
- withColumn:Creates a new column
- withColumnRenamed: renames a column in the dataframe
Action Functions
- printSchema
- head
- show
- count
- columns
- describe
- corr: get correlation
SQL Functions
- createOrReplaceTempView("name")
- sql(query)
- contains
- isNull/isNotNull
- upper
- split
- cast
- :array
- size
- getItem
- when: condição if
- otherwhise
Functions
- monotonically_increasing_id
- cache
- unpersist
data types
- Byte type
- Short type
- Integer
- long
- float
- double
- decimal
- string
- binary
- boolean
- timestamp
- date
- array
- map
- struct
Processes
- Driver
- executor
- cluster manager
- standalone: simple cluster
- apache mesos (deprecated)
- yarn: hadoop 2 resource manager
- kubernetes
Execution modes
cluster
client
local
job
- stages: set of tasks done in a single executor
- tasks: unit of computation
- stages: set of tasks done in a single executor